Skip to content

GH-48868: [Doc] Document security model for the Arrow formats#48870

Merged
pitrou merged 2 commits intoapache:mainfrom
pitrou:gh48868-format-security-model
Feb 5, 2026
Merged

GH-48868: [Doc] Document security model for the Arrow formats#48870
pitrou merged 2 commits intoapache:mainfrom
pitrou:gh48868-format-security-model

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 15, 2026

Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

What changes are included in this PR?

Add a Security Considerations page in the Format section.

Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

Are these changes tested?

N/A

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Jan 15, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Revision: 593babb

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b

Task Status
preview-docs GitHub Actions

Copy link
Member

@raboof raboof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable (without any particular Arrow expertise)

(noticed two typo's)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 15, 2026
uninitialized in a buffer if the array might be sent to, or read by, a untrusted
third-party, even when the uninitialized data is logically irrelevant. The
easiest way to do this, though perhaps not the most efficient, is to zero-initialize
any buffer that will not be populated in full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

from an untrusted source (for example because you are writing a proxy to
an arbitrary third-party service), it is **recommended** that you validate
the data first, as the consumer may assume that the data is valid already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the
C data interface might be performing only very light validation of these values.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 15, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pitrou -- this is much needed and very helpful

I had some suggestions on structure. Hopefully they are helpful

Advice for users
''''''''''''''''

If you receive Arrow in-memory data from an untrusted source, it is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we also make the point about performance here to give context about why
validation is not always performed

Perhaps something like this:

"Arrow implementations often assume Arrays follow the specification
to provide high speed processing. It is extremely important that
your application either trusts or validates the Arrays it receives from
other sources.

Many Arrow implementations provide APIs to do such validation.

In terms of APIs, the Rust implementation always validates data from external sources, unless the validation is explicitly turned off with APIs marked as unsafe (a special Rust keyword).

uninitialized in a buffer if the array might be sent to, or read by, a untrusted
third-party, even when the uninitialized data is logically irrelevant. The
easiest way to do this, though perhaps not the most efficient, is to zero-initialize
any buffer that will not be populated in full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

''''''''''''''''

If you produce a C Data Interface structure for data that nevertheless comes
from an untrusted source (for example because you are writing a proxy to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is any different than the other APIs -- basically "if you don't trust the producer source, you should always explicitly validate the arrays before processing them"

This doesn't seem any different for the C Data Interface than for the other APIs (like IPC files. etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's true. I might just remove if it muddies the message.

a trusted producer, for the reason explained above. However, it is still **recommended**
that you validate it for soundness, as a trusted producer can have bugs anyway.

IPC Format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This" means the IPC format section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I was thinking that if the guidance is the same for IPC and C Data Interface, calling them out separately makes this documentation more verbose than it could be

The high level overview in my mind is:

  1. If you read Arrow data (in any format) from an untrusted source, it can be a memory and security concern. Thus you should always verify such data.
  2. Implementers should provide a way to let users opt in / out of the verification depending on the source and their security threat model.

If there are specific things to suggest remembering to check that is specific for specific formats such as the FlagBuffers offsets for IPC based formats, then we could call them out in those sections.

I am happy to propose some new wording if you like

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the guidance is not the same. For the C Data Interface, it is simply impossible (to my knowledge) to protect against rogue raw pointers. For IPC, validation is possible to ensure safe operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no way to meaningfully validate an ArrowArray other than perhaps to check pointers for unexpected NULL. This is because the ArrowArray does not provide buffer lengths (the consumer has to infer those based on the ArrowSchema and/or the buffer data of previous buffers). This is roughly the same as receiving an Arrow C++ or arrow-rs array from another library: the consumer has to assume it was correctly produced to avoid a crash and a malicious consumer can always attempt to read byte ranges it isn't supposed to based on the buffer pointers).

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 26, 2026
@pitrou pitrou force-pushed the gh48868-format-security-model branch 2 times, most recently from 59c98d4 to 40c916b Compare January 26, 2026 17:12
@pitrou
Copy link
Member Author

pitrou commented Jan 26, 2026

I've addressed most review comments and expanded the document quite a bit:

  1. did a bit of rewording, for hopefully better clarity
  2. added a paragraph about invalid values (such as utf8 in a String array)
  3. added a section about deserialization of registered extension types
  4. added a section about robustness testing for implementations
  5. added a stub "non-Arrow formats" section

Another round of reviewing is welcome!

@pitrou
Copy link
Member Author

pitrou commented Jan 26, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Revision: 40c916b

Submitted crossbow builds: ursacomputing/crossbow @ actions-c3adcc06fb

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 26, 2026
@pitrou pitrou force-pushed the gh48868-format-security-model branch from 40c916b to 068cc96 Compare January 26, 2026 17:44
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 26, 2026
@pitrou
Copy link
Member Author

pitrou commented Jan 26, 2026

@pitrou
Copy link
Member Author

pitrou commented Jan 27, 2026

@pitrou pitrou force-pushed the gh48868-format-security-model branch from 068cc96 to 29ccd3b Compare January 27, 2026 11:07
@pitrou
Copy link
Member Author

pitrou commented Jan 27, 2026

@github-actions crossbow submit preview-docs

@pitrou pitrou marked this pull request as ready for review January 27, 2026 11:08
@github-actions github-actions bot added the awaiting change review Awaiting change review label Feb 4, 2026
@pitrou
Copy link
Member Author

pitrou commented Feb 4, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Revision: 131d70f

Submitted crossbow builds: ursacomputing/crossbow @ actions-4dc8dcf3d1

Task Status
preview-docs GitHub Actions

@pitrou pitrou force-pushed the gh48868-format-security-model branch from 131d70f to f269af7 Compare February 4, 2026 10:38
@pitrou
Copy link
Member Author

pitrou commented Feb 4, 2026

It looks the preview-docs build now fails for an unrelated reason.

In any case, do reviewers agree that this is good enough to go, or do you think there should be further changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pitrou and other reviewers

I think this document is a great addition to the Arrow documentation and will be a great reference to refer people to

I left some small wording suggestions, but I don't think any of them are required to merge this PR.

Comment on lines 41 to 48
1. *users* of Arrow: that is, developers of third-party libraries or applications
that use some of the Arrow formats or protocols by calling into Arrow libraries
as defined below;

2. *implementors* of Arrow libraries: that is, libraries that provide APIs
abstraction away from the details of the Arrow formats and protocols; such
libraries include the official Arrow implementations documented on
https://arrow.apache.org, but not only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. *users* of Arrow: that is, developers of third-party libraries or applications
that use some of the Arrow formats or protocols by calling into Arrow libraries
as defined below;
2. *implementors* of Arrow libraries: that is, libraries that provide APIs
abstraction away from the details of the Arrow formats and protocols; such
libraries include the official Arrow implementations documented on
https://arrow.apache.org, but not only.
1. *users* of Arrow: that is, developers of third-party libraries or applications
that consume data using Arrow formats or protocols created by another application.
2. *implementors* of Arrow libraries: that is, libraries that provide APIs
abstraction away from the details of the Arrow formats and protocols; such
libraries include, but are not limited to, the official Arrow implementations documented on
https://arrow.apache.org.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that changes the meaning quite a bit. I wanted to stress that "users" don't directly implement the Arrow specs, they use language-specific abstraction layers provided by an implementation. Perhaps I should just use these words :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to improve the wording a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new wording looks good to me 👍

  1. users of Arrow: that is, developers of third-party libraries or applications
    that don't implement directly implement the Arrow formats or protocols, but
    instead call language-specific APIs provided by an Arrow library
    (as defined below);

Comment on lines +73 to +82
.. TODO:
For each layout, we should list the associated security risks and the recommended
steps to validate (perhaps in Columnar.rst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like links to the invalid fuzz data has been added above

I personally think this PR is already hugely valuable, even without such a list.

Thus I suggest we merge this PR and get it published. We can file an issue to track adding type specific security risks as a follow on

A typical validation API must return a well-defined error, not crash, if the
given Arrow data is invalid; it must always be safe to execute regardless of
whether the data is valid or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the way the Rust API works is that by default, data read from IPC is explicitly validated (which is indeed quite expensive)

It is possible to turn this validation off via an unsafe API (the unsafe bit means the calling application has to explicitly disable the validation to acknowledge they are trusting the source):

https://docs.rs/arrow-ipc/57.2.0/arrow_ipc/reader/struct.FileDecoder.html#method.with_skip_validation

Advice for users
----------------

You should **never** consume a C Data Interface structure from an untrusted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good and clear 👍

@alamb alamb requested a review from Alex-PLACET February 5, 2026 11:42
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 5, 2026
@pitrou pitrou force-pushed the gh48868-format-security-model branch from ebbe049 to 61a3e81 Compare February 5, 2026 13:30
@pitrou
Copy link
Member Author

pitrou commented Feb 5, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Revision: 61a3e81

Submitted crossbow builds: ursacomputing/crossbow @ actions-19f42b185d

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 5, 2026
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 5, 2026
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these updates and putting this document together!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 5, 2026
@pitrou pitrou merged commit f39f275 into apache:main Feb 5, 2026
11 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Feb 5, 2026
@pitrou pitrou deleted the gh48868-format-security-model branch February 5, 2026 15:43
@pitrou
Copy link
Member Author

pitrou commented Feb 5, 2026

Thanks a lot for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants